09. Text: Dummy Variables

The Math Behind Dummy Variables

In the last video, you were introduced to the idea the way that categorical variables will be changed to dummy variables in order to be added to your linear models.

Then, you will need to drop one of the dummy columns in order to make your matrices full rank.

If you remember back to the closed form solution for the coefficients in regression, we have \beta is estimated by (X'X)^{-1}X'y.

In order to take the inverse of (X'X), the matrix X must be full rank. That is, all of the columns of X must be linearly independent.

If you do not drop one of the columns (from the model, not from the dataframe) when creating the dummy variables, your solution is unstable and results from Python are unreliable. You will see an example of what happens if you do not drop one of the dummy columns in the next concept.

The takeaway is … when you create dummy variables using 0, 1 encodings, you always need to drop one of the columns from the model to make sure your matrices are full rank (and that your solutions are reliable from Python).

The reason for this is linear algebra. Specifically, in order to invert matrices, a matrix must be full rank (that is, all the columns need to be linearly independent). Therefore, you need to drop one of the dummy columns, to create linearly independent columns (and a full rank matrix).